Skip to content

Extend check namespace usage #38347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

kevinetienne
Copy link

@kevinetienne kevinetienne commented Dec 7, 2020

TLDR: This code changes might be not in a mergeable state as false positives remain when updating the pattern.

The change is a bit bigger to what is described in #38093 as more files needed to be fixed, which is fine I think.

The stack displaying the files to fix might be limited to a few items (9 items in my case) when running the command pre-commit run inconsistent-namespace-usage --all-files. Every time a set of files was fixed a few others appeared which is why I ended fixing up more files and why my list of false positive is not exhaustive.

I have committed the change to the PATTERN in a separate commit as we have false positives. Such as the following files which contain:

  1. a similar function in the numpy package (ex: np.array see
    values = np.array(values, dtype=object) # object dtype to avoid casting
    )
  2. a similar function in the pyarrow package (ex: pyarrow.array see
    "a": pyarrow.array([1, 2, 3, None], "int64"),
    )
  3. a function can be called on dataframe or a serie (ex: df.isna and aggr.isna see
    aggr[aggr.isna()] = "missing"
    and
    result = df.groupby(["A", "B"]).agg(lambda df: df.isna().sum())
  4. or a function is called with any prefixes (see the few calls to factorize in
    result_codes, result_uniques = obj.factorize(sort=sort)
    )
  5. eval vs pd.eval (see https://github.com/pandas-dev/pandas/blob/862cd05df4452592a99dd1a4fa10ce8cfb3766f7/pandas/tests/computation/test_eval.py)

I'm probably missing a few other cases but 1) 2) and 3) can be added in the negative lookahead (ignore function/methods which start with np/pyarrow/df). However the regex might grow bigger and bigger. 5) can be ignored entirely either within the regex or with an exclusion list.

Let me know you're thinking as I see three solutions:

  • update the regex and maybe add an exclusion list (for case like eval) and see where that leads
  • remove last commit for PATTERN and create another ticket to address the other cases
  • close the ticket and find/investigate another solution (ex: separate class vs function check/check import instead...)

@MarcoGorelli MarcoGorelli self-requested a review December 8, 2020 15:03
@MarcoGorelli
Copy link
Member

eval vs pd.eval

I hadn't thought of that...not entirely sure what the best way to go about this is but I'll get back to you on this

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 16, 2020

Hey @kevinetienne

Sorry this taken me a bit to get back to you on. I've looked into it further and a few suggestions:

  • let's make a blocklist in which to put the (few) cases we want to ignore, such as eval and pd.eval
  • instead of checking that the pattern doesn't start with pd., let's just check it doesn't start with .

I think that applying the following diff should get us a good way there:

diff --git a/scripts/check_for_inconsistent_pandas_namespace.py b/scripts/check_for_inconsistent_pandas_namespace.py
index b213d931e7..908a056fdc 100644
--- a/scripts/check_for_inconsistent_pandas_namespace.py
+++ b/scripts/check_for_inconsistent_pandas_namespace.py
@@ -16,20 +16,20 @@ from typing import Optional, Sequence
 
 PATTERN = r"""
     (
-        (?<!pd\.)(?<!\w)    # check class_name doesn't start with pd. or character
-        ([A-Z]\w+)\(        # match DataFrame but not pd.DataFrame or tm.makeDataFrame
-        .*                  # match anything
-        pd\.\2\(            # only match e.g. pd.DataFrame
+        (?<![\.\w])    # check class_name doesn't start with . or character
+        (\w+)\(        # match DataFrame but not pd.DataFrame or tm.makeDataFrame
+        .*             # match anything
+        pd\.\2\(       # only match e.g. pd.DataFrame
     )|
     (
-        pd\.([A-Z]\w+)\(    # only match e.g. pd.DataFrame
-        .*                  # match anything
-        (?<!pd\.)(?<!\w)    # check class_name doesn't start with pd. or character
-        \4\(                # match DataFrame but not pd.DataFrame or tm.makeDataFrame
+        pd\.(\w+)\(    # only match e.g. pd.DataFrame
+        .*             # match anything
+        (?<![\.\w])    # check class_name doesn't start with . or character
+        \4\(           # match DataFrame but not pd.DataFrame or tm.makeDataFrame
     )
     """
 ERROR_MESSAGE = "Found both `pd.{class_name}` and `{class_name}` in {path}"
-
+BLOCK_LIST = {'eval'}
 
 def main(argv: Optional[Sequence[str]] = None) -> None:
     parser = argparse.ArgumentParser()
@@ -45,11 +45,11 @@ def main(argv: Optional[Sequence[str]] = None) -> None:
         match = pattern.search(contents)
         if match is None:
             continue
-        if match.group(2) is not None:
+        if match.group(2) is not None and match.group(2).decode() not in BLOCK_LIST:
             raise AssertionError(
                 ERROR_MESSAGE.format(class_name=match.group(2).decode(), path=str(path))
             )
-        if match.group(4) is not None:
+        if match.group(4) is not None and match.group(4).decode() not in BLOCK_LIST:
             raise AssertionError(
                 ERROR_MESSAGE.format(class_name=match.group(4).decode(), path=str(path))
             )

Also, there's a few conflicts, could you please fetch and merge upstream/master?

@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

@kevinetienne pls merge master. this will need to be merged soon after otherwise likely will have further conflicts. ping on green.

@jreback jreback added the Code Style Code style, linting, code_checks label Jan 3, 2021
@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

@kevinetienne would prefer that you limit scope to this to a single type of change (and instead use multiple PRs) as large PRs are very likley to take a long time to review (unless they are trivial changes but that is not the case here)

@kevinetienne
Copy link
Author

I'm closing this PR and I'll update the main ticket later. I'll break this PR in several parts. It would be easier for me to deliver too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Style Code style, linting, code_checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants